# DIS08 / OR92 Data Modeling: Python - Web Scraping

Timo Breuer, Faculty of Information Science and Communication Studies, Institute of Information Management

**Disclaimer:** Some contents are taken from https://carpentries-incubator.github.io/lc-webscraping/

---
## Basics of the Web, HTTP, and HTML
---

### What is the Web?

- A collection of interconnected documents and resources accessed via the internet.
- Web browsers (e.g., Chrome, Firefox, Safari) are used to navigate the web.

### What is HTTP?

**HTTP = HyperText Transfer Protocol:**
- It is a protocol for transferring data between a client (browser) and a server.
  
**Key Features:**
- Request-Response Model
- Stateless (no memory of previous requests).
  
**HTTP Methods:**
- `GET`: Retrieve data.
- `POST`: Send data.
- `PUT`: Update data.
- `DELETE`: Remove data.

Example Request:
```
GET /index.html HTTP/1.1
Host: example.com
```

Example Response:
```
HTTP/1.1 200 OK
Content-Type: text/html
```

### What is HTML?

**HTML (HyperText Markup Language):**
- Language for structuring web pages.
- Defines content and layout.

**HTML Elements:**
- Exmaple tags: `<h1>`, `<p>`, `<a>`.
- Example attributes: class, id, href.

**Basic HTML Example:**
```
<html>
  <head>
    <title>My Web Page</title>
  </head>
  <body>
    <h1>Welcome to My Page</h1>
    <p>This is a paragraph.</p>
    <a href="https://example.com">Click Here</a>
  </body>
</html>
```


### How Does a Web Page Load?

**Step-by-Step Process:**
1.	Browser sends an HTTP request to a server.
2.	Server processes the request and sends an HTTP response (usually HTML).
3.	Browser renders the HTML, CSS, and JavaScript to display the page.

[![](https://mermaid.ink/img/pako:eNp1UctuwjAQ_BXLJ5DSNDHkeeBQ-qBVKyGSU5WLG2-TCGKntlOgiH-vk0APVFhaybs7M7v2HHAuGOAYK_hqgedwX9FC0jrjyJyGSl3lVUO5RndSbBXI_40E5HdXHzon2M1sNtRj0-cMLdJ0iVbdDKXRCOzCttDTQ4puK85gZ5e63owHgYFm-Cel-MxVjeAK0GiRvr2iueAauB5fDv0jLak04A57ba9H0HmJNhVfA0MSlGhlDgqN5klioRcTzzUtQF1dq3_XJf3qPiuDBom28NEYVWzhGmRNK2a-_tCRMqxLqCHDsbkyKtcZzvjR4GirRbLnOY61bMHCUrRFeU7ahlF99uxcNK68C2HST7pRJgdWaSHfBp97u3sMjg94h-NgYpPImYau5weTkISuhfc4Jq4duST0AsfzfH_q-ORo4Z9e1bUdhxA_ckjku8HEifzjL6b9u78?type=png)](https://mermaid.live/edit#pako:eNp1UctuwjAQ_BXLJ5DSNDHkeeBQ-qBVKyGSU5WLG2-TCGKntlOgiH-vk0APVFhaybs7M7v2HHAuGOAYK_hqgedwX9FC0jrjyJyGSl3lVUO5RndSbBXI_40E5HdXHzon2M1sNtRj0-cMLdJ0iVbdDKXRCOzCttDTQ4puK85gZ5e63owHgYFm-Cel-MxVjeAK0GiRvr2iueAauB5fDv0jLak04A57ba9H0HmJNhVfA0MSlGhlDgqN5klioRcTzzUtQF1dq3_XJf3qPiuDBom28NEYVWzhGmRNK2a-_tCRMqxLqCHDsbkyKtcZzvjR4GirRbLnOY61bMHCUrRFeU7ahlF99uxcNK68C2HST7pRJgdWaSHfBp97u3sMjg94h-NgYpPImYau5weTkISuhfc4Jq4duST0AsfzfH_q-ORo4Z9e1bUdhxA_ckjku8HEifzjL6b9u78)

---
## What is Web Scraping?
---

**Definition:** Extracting data from websites programmatically.

**Why scrape?**
- Converts non-tabular or poorly structured data into usable formats like .csv files, spreadsheets, or database-compatible formats.
- Access information not available via APIs.
- Automate data collection for analysis.

**Use Cases:**
- Data Collection: Acquire specific data points from websites.
- Data Archiving: Preserve online data for future reference.
- Change Tracking: Monitor updates or changes to online information
- Examples:
  - Data for machine learning.
  - Price comparison/Competitive Analysis: Businesses scrape competitor prices to adjust their own.
  - Contact Scraping: Collecting personal information for marketing purposes.
  - Academic Research/Trend Analysis: Creating datasets for text mining and analysis in scholarly projects.
  - Data Journalism: Investigative journalists scrape data for stories, especially when not easily accessible.

---
## Tools for Web Scraping
---

**Languages:** Python (preferred), but also PHP or Ruby (i.e., other scripting languages).

**Python Libraries:**
- **Requests:** https://requests.readthedocs.io/en/latest/  
- **Beautiful Soup:** https://beautiful-soup-4.readthedocs.io/
- **Scrapy:** https://scrapy.org/
- **Selenium:** https://selenium-python.readthedocs.io/ 
- **Parsel:** https://parsel.readthedocs.io/en/latest/

**Other Tools:**
- **Browser Developer Tools** for inspection (Shortcut: `F12`)
- **Regex** for string patterns

---
## A first example
---

Make a local copy of the `index.html`. (Later you can keep it in memory without writing it to the disk.)

In [5]:
import requests

# Specify the URL of the page you want to download
url = "https://www.scrapethissite.com/pages/simple/"

# Make a GET request to fetch the raw HTML content
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    # Save the content to an HTML file
    with open("index.html", "w", encoding="utf-8") as file:
        file.write(response.text)
    print("Page saved as 'page.html'")
else:
    print(f"Failed to download page. Status code: {response.status_code}")

ModuleNotFoundError: No module named 'requests'

Inspect the file and get familiar with the structure.

In [None]:
!cat index.html

**As an alternative you can also use the Developer Tools of your web browser!**

In [None]:
from bs4 import BeautifulSoup

# If the HTML is saved locally, you can use:
with open("index.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, 'html.parser')

# Initialize an empty dictionary to store the country data
countries_dict = {}

# Find all div elements with class 'col-md-4 country'
countries = soup.find_all('div', class_='col-md-4 country')

# Iterate over each country div
for country in countries:
    # Extract country name
    name = country.find('h3', class_='country-name').get_text(strip=True).split('\n')[-1].strip()
    
    # Extract capital, population, and area
    capital = country.find('span', class_='country-capital').get_text(strip=True)
    population = country.find('span', class_='country-population').get_text(strip=True)
    area = country.find('span', class_='country-area').get_text(strip=True)

    # Store the data in the dictionary
    countries_dict[name] = {
        'Capital': capital,
        'Population': population,
        'Area (km^2)': area
    }

# Print the resulting dictionary
for country, data in countries_dict.items():
    print(f"{country}: {data}")

---
## Don’t break the web: Denial of Service attacks
---

**Web Scraping Basics:**
- Web scraping involves repeatedly querying a website and accessing multiple pages.
- Each request to a web server consumes its resources, potentially impacting other users.

**Risks of Excessive Requests:**
- Sending too many requests in a short time can overload a server.
- Overloading may block legitimate users or crash the server.
- Hackers exploit this technique intentionally in Denial of Service (DoS) attacks.

**Server Protections Against Abuse:**
- Servers monitor for excessive requests from a single IP address.
- They may block or ban the source to prevent misuse.

**Challenges for Scrapers:**
- Legitimate web scrapers can inadvertently mimic DoS attacks.
- This may result in being banned from the website.

**Preventive Measures:**
- Insert random delays between requests to the server.
- Example: Scrapy includes built-in safeguards like random delays between requests to avoid overloading servers. By default, Scrapy limits the risk of resembling a DoS attack.

**Best Practices for Developers:**
- Limit the number of pages scraped during development and debugging.
- Use Scrapy’s allowed_domains property to restrict scraping to specific domains.
- Avoid scraping at full scale until the scraper is thoroughly tested.

**In summary:** With proper precautions, the risk of causing trouble is minimal! (Learn to play by the rules)

_Source:_ https://carpentries-incubator.github.io/lc-webscraping/05-conclusion/index.html

---
## Don’t steal: Copyright and fair use
--- 

**Legal Considerations of Web Scraping:**
- Scraping may be illegal if a website’s terms and conditions explicitly prohibit copying its content.
- Violating such terms could lead to legal trouble.
  
**General Acceptance:**
- Web scraping is often tolerated if it does not disrupt the regular use of a website.
- It is comparable to using a browser to access publicly available web content.

**Public vs. Protected Data:**
- Scraping publicly available data (not behind authentication or paywalls) is generally acceptable.
- Problems arise when scraped data is redistributed or shared inappropriately.

**Copyright Concerns:**
- Republishing scraped content, especially without permission, can constitute copyright infringement.
- Simply posting content from one site onto another as one’s own is illegal.

**Fair Use Exceptions:**
- Using scraped data in aggregate or derivative formats may qualify as “fair use.”
- Avoid passing off scraped data as original, copying it verbatim, or monetizing it without permission.

**Legal Variability:**
- Copyright and data privacy laws vary by country; check the laws applicable to your location.
- Example: In Australia, scraping and storing personal information (e.g., names, phone numbers, email addresses) can be illegal, even if publicly available.

**Personal vs. Large-Scale Use:**
- For personal use, following general scraping guidelines is usually sufficient.
- For large-scale research or commercial projects, seek legal advice beforehand.

**University Resources:**
- Universities often have a copyright office to assist with legal aspects of data scraping projects.
- The university library is a good starting point for guidance on copyright issues.

**Best Practices:** Scrape responsibly, respect terms of use, and ensure compliance with copyright laws.

_Source:_ https://carpentries-incubator.github.io/lc-webscraping/05-conclusion/index.html

---
## robots.txt
---

**What is robots.txt?**
- A file placed at the root of a website (example.com/robots.txt) to communicate with web crawlers and bots.
- It specifies which parts of the website can or cannot be accessed by automated programs.

**Purpose of robots.txt:**
- Helps manage server load by restricting unnecessary crawling.
- Protects sensitive or non-public sections of a site from being indexed.
- Provides guidelines for ethical web scraping and crawling.

**How robots.txt Works:**
- Uses directives to instruct bots (e.g., User-agent: * applies to all bots).

**Common directives include:**
- `Disallow`: specifies paths that bots should not crawl.
- `Allow`: permits access to specific paths.
- `Sitemap`: points bots to the website’s sitemap for better indexing.

**Limitations of robots.txt:**
- It is advisory, not enforceable; malicious or non-compliant bots can ignore it.
- Does not provide security; sensitive content should be protected by other means (e.g., authentication).

**Best Practices for Web Scraping:**
- Always check a website’s robots.txt file before scraping.
- Respect the rules specified in the file to avoid violating ethical or legal standards.
- Use tools or libraries (e.g., Python’s robotsparser) to parse and adhere to robots.txt. https://docs.python.org/3/library/urllib.robotparser.html

**Locating robots.txt:**
- Visit the URL https://[website-domain]/robots.txt to view its contents.
- Note that not all websites have a robots.txt file.

_Source:_ https://carpentries-incubator.github.io/lc-webscraping/05-conclusion/index.html

---
## Web scraping code of conduct
---

If you adhere to the following simple rules, you will probably be fine.

1. **Ask nicely.** If your project requires data from a particular organisation, for example, you can try asking them directly if they could provide you what you are looking for. With some luck, they will have the primary data that they used on their website in a structured format, saving you the trouble.
2. **Don’t download copies of documents that are clearly not public.** For example, academic journal publishers often have very strict rules about what you can and what you cannot do with their databases. Mass downloading article PDFs is probably prohibited and can put you (or at the very least your friendly university librarian) in trouble. If your project requires local copies of documents (e.g. for text mining projects), special agreements can be reached with the publisher. The library is a good place to start investigating something like that.
3. **Check your local legislation.** For example, certain countries have laws protecting personal information such as email addresses and phone numbers. Scraping such information, even from publicly avaialable web sites, can be illegal (e.g. in Australia).
4. **Don’t share downloaded content illegally.** Scraping for personal purposes is usually OK, even if it is copyrighted information, as it could fall under the fair use provision of the intellectual property legislation. However, sharing data for which you don’t hold the right to share is illegal.
Share what you can. If the data you scraped is in the public domain or you got permission to share it, then put it out there for other people to reuse it (e.g. on datahub.io). If you wrote a web scraper to access it, share its code (e.g. on GitHub) so that others can benefit from it.
5. **Don’t break the Internet.** Not all web sites are designed to withstand thousands of requests per second. If you are writing a recursive scraper (i.e. that follows hyperlinks), test it on a smaller dataset first to make sure it does what it is supposed to do. Adjust the settings of your scraper to allow for a delay between requests. By default, Scrapy uses conservative settings that should minimize this risk.
6. **Publish your own data in a reusable way.** Don’t force others to write their own scrapers to get at your data. Use open and software-agnostic formats (e.g. JSON, XML), provide metadata (data about your data: where it came from, what it represents, how to use it, etc.) and make sure it can be indexed by search engines so that people can find it.

_Source:_ https://carpentries-incubator.github.io/lc-webscraping/05-conclusion/index.html

---
## Video recommendations 
---

[David Kriesel - SpiegelMining – Reverse Engineering von Spiegel-Online @ 33c3](https://media.ccc.de/v/33c3-7912-spiegelmining_reverse_engineering_von_spiegel-online)

<iframe width="512" height="288" src="https://media.ccc.de/v/33c3-7912-spiegelmining_reverse_engineering_von_spiegel-online/oembed" frameborder="0" allowfullscreen></iframe>

[David Kriesel - BahnMining - Pünktlichkeit ist eine Zier @ 36C3](https://media.ccc.de/v/36c3-10652-bahnmining_-_punktlichkeit_ist_eine_zier)

<iframe width="512" height="288" src="https://media.ccc.de/v/36c3-10652-bahnmining_-_punktlichkeit_ist_eine_zier/oembed" frameborder="0" allowfullscreen></iframe>

---
## Further reading
---

https://en.wikipedia.org/wiki/Data_journalism  
https://en.wikipedia.org/wiki/Web_scraping

---
## Key points
---

- Web scraping is, in general, legal and won’t get you into trouble.

- There are a few things to be careful about, notably don’t overwhelm a web server and don’t steal content.

- Be nice. In doubt, ask.

_Source:_ https://carpentries-incubator.github.io/lc-webscraping/05-conclusion/index.html

---
# Lab assignment
---

## Scraping and evaluating NHL team stats since 1990

Open this site to browse through a database of NHL team stats since 1990:

https://www.scrapethissite.com/pages/forms/

Luckily, the data is already provided in tabular format. However, not all of the data is shown in a single page. Instead, you will have to paginate through the database. In the following, your task is to scrape all of the data and find a way to automatically scrape all of the pages and fetch the complete database. Then write everything into a CSV file. The columns names should correspond to those in the HTML table. Afterwards you will load the CSV file with pandas and give answers to the following two questions:

- How made the most "wins" in 1990, 2000, and 2010?
- How many teams participated in 1991, 2001, and 2011?

**In summary:**
1. Scrape the data (find a way to obtain all data from a pages programmatically)
2. Store the entire data as CSV file
3. Load it with pandas
4. Give answers to the questions above by making use of the DataFrame methods 

Like in the example above, we recommend to use **requests** and **beautifulsoup**

In [16]:
import requests as rs
import pandas as pd
from bs4 import BeautifulSoup

url = "https://www.scrapethissite.com/pages/forms/?page_num=1&per_page=1000"

response = requests.get(url)

if response.status_code == 200:
    # Save the content to an HTML file
    with open("page.html", "w", encoding="utf-8") as file:
        file.write(response.text)
    print("Page saved as 'page.html'")
else:
    print(f"Failed to download page. Status code: {response.status_code}")

Page saved as 'page.html'


In [17]:
import pandas as pd
from bs4 import BeautifulSoup

# Lokale HTML-Datei laden
with open("page.html", "r", encoding="utf-8") as file:
    html_content = file.read()

# BeautifulSoup-Objekt erstellen
soup = BeautifulSoup(html_content, 'html.parser')

# Tabelle finden
table = soup.find('table', {'class': 'table'})

# Überprüfen, ob die Tabelle existiert
if not table:
    print("Keine Tabelle in der HTML-Datei gefunden.")
    exit()

# Tabellenzeilen extrahieren
rows = table.find_all('tr')

# Spaltenüberschriften (Header) extrahieren
headers = [header.text.strip() for header in rows[0].find_all('th')]

# Datenzeilen extrahieren
data = []
for row in rows[1:]:  # Überspringe die Header-Zeile
    columns = [col.text.strip() for col in row.find_all('td')]
    if columns:
        data.append(columns)

# DataFrame erstellen
df = pd.DataFrame(data, columns=headers)

# DataFrame in eine CSV-Datei speichern
csv_filename = "extracted_data.csv"
df.to_csv(csv_filename, index=False)
print(f"Daten erfolgreich in '{csv_filename}' gespeichert.")


Daten erfolgreich in 'extracted_data.csv' gespeichert.


When you have completed the tasks, please commit this notebook with the solution to your GitHub repository in the directory `assignments/08/`.

In [19]:
import pandas as pd

# CSV-Datei laden
df = pd.read_csv('extracted_data.csv')

# Erste Zeilen anzeigen, um die Daten zu überprüfen
print(df.head())


            Team Name  Year  Wins  Losses  OT Losses  Win %  Goals For (GF)  \
0       Boston Bruins  1990    44      24        NaN  0.550             299   
1      Buffalo Sabres  1990    31      30        NaN  0.388             292   
2      Calgary Flames  1990    46      26        NaN  0.575             344   
3  Chicago Blackhawks  1990    49      23        NaN  0.613             284   
4   Detroit Red Wings  1990    34      38        NaN  0.425             273   

   Goals Against (GA)  + / -  
0                 264     35  
1                 278     14  
2                 263     81  
3                 211     73  
4                 298    -25  


In [28]:
import pandas as pd

# CSV-Datei laden
df = pd.read_csv('extracted_data.csv')

# Bereinige Spaltennamen
df.columns = df.columns.str.strip()  # Entfernt Leerzeichen
df.columns = df.columns.str.lower()  # Wandelt alle Spaltennamen in Kleinbuchstaben

# Die Jahre, die wir analysieren wollen
years = [1990, 2000, 2010]

# Ergebnisse speichern
most_wins = {}

for year in years:
    # Filtere nach Jahr und sortiere nach 'wins'
    year_data = df[df['year'] == year]
    if not year_data.empty:
        # Team mit den meisten Siegen
        top_team = year_data.sort_values(by='wins', ascending=False).iloc[0]
        most_wins[year] = {'Team': top_team['team name'], 'Wins': top_team['wins']}
    else:
        print(f"Keine Daten für das Jahr {year} gefunden.")

# Ergebnisse anzeigen
for year, stats in most_wins.items():
    print(f"Im Jahr {year} gewann {stats['Team']} mit {stats['Wins']} Siegen")



Im Jahr 1990 gewann Chicago Blackhawks mit 49 Siegen
Im Jahr 2000 gewann Colorado Avalanche mit 52 Siegen
Im Jahr 2010 gewann Vancouver Canucks mit 54 Siegen


In [29]:
# Die Jahre, die wir analysieren wollen
years_participation = [1991, 2001, 2011]

# Ergebnisse speichern
team_counts = {}

for year in years_participation:
    # Filtere nach Jahr
    year_data = df[df['year'] == year]
    if not year_data.empty:
        # Zähle eindeutige Teams
        team_counts[year] = year_data['team name'].nunique()
    else:
        print(f"Keine Daten für das Jahr {year} gefunden.")

# Ergebnisse anzeigen
for year, count in team_counts.items():
    print(f"Im Jahr {year} haben {count} Teams teilgenommen")


Im Jahr 1991 haben 22 Teams teilgenommen
Im Jahr 2001 haben 30 Teams teilgenommen
Im Jahr 2011 haben 30 Teams teilgenommen
