<center>🌐 Web Data Collection Workshop</center>
<center>Scraping Static and Dynamic Sites with BeautifulSoup and Selenium</center>

### 📚 In this workshop you will learn:
**Static Websites**: Data extraction with `BeautifulSoup` and `requests`  
**Dynamic Websites**: Data extraction with `Selenium`  
**APIs**: Retrieving and sending data  
**Authentication**: Accessing protected pages  

        
### 🔗 Resources:
[BeautifulSoup Documentation](https://beautiful-soup-4.readthedocs.io/en/latest/)
[Selenium Documentation](https://selenium-python.readthedocs.io/)

## 📦 Installing Required Libraries

In [1]:
# Install required libraries\n",
!pip install requests beautifulsoup4 tqdm selenium



# 🔍 Part 1: Extracting Data from Static Websites with BeautifulSoup

## Sending HTTP Requests with requests

In [2]:
import requests

# Define the website URL
URL = "https://www.varzesh3.com/"

# Send GET request",
page = requests.get(URL)
#For sites with authentication: page = requests.get(URL, auth=('user', 'pass'))

# Display the first 500 characters of the text response",
print("📄 Page text content (first 500 characters):")
print(page.text[:500])
# Display the first 100 bytes of the binary response
print("💾 Page binary content (first 100 bytes):")
print(page.content[:100])

📄 Page text content (first 500 characters):



<!DOCTYPE html>
<html lang="fa" prefix="og: http://ogp.me/ns#">
<head>
    <title>مرجع فوتبال و ورزش | ورزش سه</title>
    <meta charset="utf-8" />
    
    <meta name="viewport" content="width=1170" />
    
    <meta name="description" content="پایگاه اطلاع رسانی ورزشی برای فارسی زبانان كه اخبار حوزه ورزش (فوتبال،والیبال ،بسكتبال و...)ونتایج بازیها و جداول لیگ های ورزشی را بصورت زنده ارائه می کند" />





<link href="https://www.varzesh3.com/" rel="canonical" />

<meta pro
💾 Page binary content (first 100 bytes):
b'\r\n\r\n\r\n<!DOCTYPE html>\r\n<html lang="fa" prefix="og: http://ogp.me/ns#">\r\n<head>\r\n    <title>\xd9\x85\xd8\xb1\xd8\xac\xd8\xb9 '


## Parsing HTML with BeautifulSoup
We'll start with a simple example to get familiar with BeautifulSoup's features:

In [3]:
# A simple HTML example",
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

### Converting HTML text to a BeautifulSoup object

In [4]:
from bs4 import BeautifulSoup
# Create BeautifulSoup object from HTML text
soup = BeautifulSoup(html_doc, 'html.parser')

# Display HTML in a readable format",
print("🔍 Prettified HTML:")
print(soup.prettify())

🔍 Prettified HTML:
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>



### Extracting and Manipulating HTML Elements\n",
#### 1. Accessing Elements

In [9]:
# Title element",
print(f"🏷️ Title element: {soup.title} ")
print(f"📝 Element name: {soup.title.name}")
print(f"📄 Element text: {soup.title.string}")
print(f"👪 Parent element name: {soup.title.parent.name}")

🏷️ Title element: <title>The Dormouse's story</title> 
📝 Element name: title
📄 Element text: The Dormouse's story
👪 Parent element name: head


In [10]:
#First p element",
print(f"📌 First p element: {soup.p}")
print(f"🔖 p element class: {soup.p['class']}")

📌 First p element: <p class="title"><b>The Dormouse's story</b></p>
🔖 p element class: ['title']


#### 2. Navigating Between Elements

In [11]:
soup.a

<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

In [12]:
soup.find_all('a')

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

In [13]:
soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

In [14]:
# Access all links",
print(" 🔗 All links:")
for link in soup.find_all('a'):
    print(f"-{link.get('href')}")
#Extract all text",
print("📝 All text:")
print(soup.get_text())

 🔗 All links:
-http://example.com/elsie
-http://example.com/lacie
-http://example.com/tillie
📝 All text:

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



#### 3. Working with Element Attributes

In [8]:
# Different ways to access element attributes",
print(f"🔖 p class using []: {soup.p['class']}")
print(f"🔖 p class using get(): {soup.p.get('class')}")
print(f"📦 All p attributes: {soup.p.attrs}")

🔖 p class using []: ['title']
🔖 p class using get(): ['title']
📦 All p attributes: {'class': ['title']}


#### 4. Accessing Text Content

In [16]:
# Text inside p element",
print(f"📄 Text inside p element: {soup.p.string}")
# Text inside first link",
print(f"📄 Text inside first link: {soup.a.string}")

📄 Text inside p element: The Dormouse's story
📄 Text inside first link: Elsie


#### 5. Navigating the HTML Tree

In [15]:
# Body contents",
print("👨‍👩‍👧‍👦 Body element contents:")
for i, item in enumerate(soup.body.contents):
    print(f"  {i}: {item}")
print(f"🔢 Number of elements inside body: {len(soup.body.contents)}")

👨‍👩‍👧‍👦 Body element contents:
  0: 

  1: <p class="title"><b>The Dormouse's story</b></p>
  2: 

  3: <p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
  4: 

  5: <p class="story">...</p>
  6: 

🔢 Number of elements inside body: 7


### Searching in HTML
BeautifulSoup offers various methods for searching HTML:

In [18]:
# Recreate soup to be safe
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

# Search by tag name"
print("🔍 Search for all b elements:")
print(soup.find_all('b'))

🔍 Search for all b elements:
[<b>The Dormouse's story</b>]


In [19]:
# Search using regular expression
import re
print("🔍 Search for all elements with names starting with 'b':")
for tag in soup.find_all(re.compile("^b")):
    print(f"  {tag}")

# Search for multiple element types",
print("🔍 Search for all a and b elements:")
print(soup.find_all(["a", "b"]))

🔍 Search for all elements with names starting with 'b':
  <body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>
  <b>The Dormouse's story</b>
🔍 Search for all a and b elements:
[<b>The Dormouse's story</b>, <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]


### Advanced Search with Element Attributes

In [26]:
# Search by combining name and attributes
print("🔍 Search for all a elements with id='link1' and class='sister':")
result = soup.find_all(name="a", id="link1", class_="sister")
for item in result:
    print(f"  {item}")
# Search for the first matching element
print("🔍 Search for the first a element:")
print(soup.find("a"))

🔍 Search for all a elements with id='link1' and class='sister':
  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
🔍 Search for the first a element:
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


## 💻 Practical Example 1: Extracting News Text from a Sports Website

In [40]:
import requests
from bs4 import BeautifulSoup
# URL of a news article (replace with your chosen URL)"
url = "https://www.varzesh3.com/news/2113453/%D8%AF%D8%B1%D9%88%D8%A7%D8%B2%D9%87-%D8%A8%D8%A7%D9%86-%D8%AC%D8%AF%DB%8C%D8%AF-%D9%85%D9%84%DB%8C-%D9%BE%D9%88%D8%B4-%D8%A7%D8%B2-%D8%AA%DB%8C%D9%85-%D8%AD%D8%B3%DB%8C%D9%86%DB%8C-%D8%A7%D9%93%D9%85%D8%AF"
# Send GET request
response = requests.get(url)
# Parse HTML
soup = BeautifulSoup(response.text, "html.parser")
# Find all paragraphs with the specific style attribute
paragraphs = soup.find_all("p", style="text-align:justify;")
# Extract and join the text of all paragraphs
news_text = "\n".join(p.text.strip() for p in paragraphs)
#Print the extracted news text
print("📰 Extracted News Content:")
print(news_text)

📰 Extracted News Content:
به گزارش "ورزش سه"، فهرست جدید تیم ملی فوتبال ایران در حالی از سوی امیر قلعه‌نویی اعلام شد که نام محمد خلیفه، دروازه‌بان ۲۰ ساله آلومینیوم اراک هم در آن دیده می‌شود.
محمد خلیفه که در نقل و انتقالات تابستانی ۱۴۰۲ با نظر سیدمجتبی حسینی به آلومینیوم اراک پیوست و فرصت درخشش در لیگ برتر زیر نظر این سرمربی را به دست آورد، حالا در جمع ملی فوتبال ایران برای دو مسابقه مقدماتی جام جهانی ۲۰۲۶ مقابل امارات و ازبکستان قرار گرفته است.
این دروازه‌بان در فصل گذشته لیگ برتر مقابل نساجی مازندران، استقلال تهران و فولاد خوزستان فرصت بازی به دست آورد و در فصل جاری هم غیر از یک مسابقه که به‌خاطر محرومیت حضور نداشت، در ۲۲ بازی دیگر همواره در ترکیب بوده و نمایش درخشانی ارائه داده است.
خلیفه در ۲۲ مسابقه فصل جاری لیگ برتر توانسته ۸ کلین‌شیت به ثبت رساند و امشب هم علی‌رغم دریافت دو گل مقابل فولاد خوزستان نمایش خوبی داشت تا برای نخستین بار شانس دعوت شدن به تیم ملی را به دست بیاورد.
در فصل جاری جوانان پرشماری با نظر مجتبی حسینی فرصت خودنمایی در‌ آلومینیوم اراک را به‌دست آوردند و محمد خلی

# 🤖 Part 2: Web Automation with Selenium
For dynamic websites where content is loaded using JavaScript, we need Selenium.

## Installing Selenium and Setting Up the WebDriver

In [20]:
# Install Selenium if not already installed"
!pip install selenium



In [23]:
# You may need to download the appropriate ChromeDriver from:
# https://sites.google.com/chromium.org/driver/

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
import time

# Initialize the Chrome driver
driver = webdriver.Chrome()

In [24]:
# Open Google's homepage
driver.get("http://www.google.com")

In [25]:
# Close the browser window
driver.close()

In [26]:
# Reopen browser and navigate to Google
driver = webdriver.Chrome()
driver.get("http://www.google.com")

In [27]:
# Find the search box element by its ID
element = driver.find_element(By.ID, "APjFqb")

# Alternative ways to find elements:
# element = driver.find_element(By.NAME, "textarea")
# element = driver.find_element(By.XPATH, "//input[@id='passwd-id']")
# element = driver.find_element(By.CSS_SELECTOR, "input#passwd-id")

In [28]:
# Type "NLP" into the search box
element.send_keys("NLP")

In [29]:
# Find the search button by its class name and click it
element = driver.find_element(By.CLASS_NAME, "gNO89b")
element.click()

In [30]:
# Navigate to a Google Form
driver.get('https://docs.google.com/forms/d/e/1FAIpQLSe0sa36n5-td7G-r0WP2ggHLStW__xOZKFQYypZs04GN3fvgA/viewform?usp=headerg')

In [31]:
# Find radio button options and click each one with a delay
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]")

all_options = element.find_elements(By.TAG_NAME, "label")
for option in all_options:
    option.click()
    time.sleep(2)

In [32]:
# Find a text input field by its XPATH
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]/div[1]/div/div/div[2]/div/div[1]/div/div[1]/input")

In [33]:
# Type a name into the input field
element.send_keys("Amirreza")

In [34]:
# List of all Selenium locator strategies
# ID = "id"
# NAME = "name"
# XPATH = "xpath"
# LINK_TEXT = "link text"
# PARTIAL_LINK_TEXT = "partial link text"
# TAG_NAME = "tag name"
# CLASS_NAME = "class name"
# CSS_SELECTOR = "css selector"

In [35]:
# Close the browser window
driver.close()

# API

In [36]:
# API Requests

import requests

# GET request example
api_url = "https://jsonplaceholder.typicode.com/todos/1"
response = requests.get(api_url)
response.json()

{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

In [37]:
# POST request example
import requests

api_url = "https://jsonplaceholder.typicode.com/todos"
todo = {"userId": 1, "title": "Buy milk", "completed": False}
response = requests.post(api_url, json=todo)
response.json()

{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}

# Example2
Complete the following form using the concepts learned in class.

In [38]:
# Example 2: Complete a Google Form using Selenium

driver = webdriver.Chrome()
driver.get("https://docs.google.com/forms/d/e/1FAIpQLSe0sa36n5-td7G-r0WP2ggHLStW__xOZKFQYypZs04GN3fvgA/viewform?usp=headerg")

In [41]:
# Enter first name
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]/div[1]/div/div/div[2]/div/div[1]/div/div[1]/input")
element.send_keys("Amirreza")

# Enter Student number
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]/div[2]/div/div/div[2]/div/div[1]/div/div[1]/input")
element.send_keys("123456789")

# Enter email address
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]/div[3]/div/div/div[2]/div/div[1]/div/div[1]/input")
element.send_keys("amirrezahosseinymehr@gmail.com")

# Select radio button
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]/div[4]/div/div/div[2]/div/div/span/div/div[1]/label")
element.click()

# Enter text in textarea
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[2]/div[5]/div/div/div[2]/div/div[1]/div[2]/textarea")
element.send_keys(news_text)

In [42]:
# Click submit button
element = driver.find_element(By.XPATH, "/html/body/div/div[2]/form/div[2]/div/div[3]/div[1]/div[1]/div/span")
element.click()

# More Example

In [44]:
from tqdm import tqdm
import requests
from bs4 import BeautifulSoup
import os

if not os.path.exists('download_log.txt'):
    with open('download_log.txt', 'w') as f:
        pass
with open('download_log.txt', 'r+') as f:
    lecture_names = f.readlines()
lecture_names = [i.strip() for i in lecture_names]

base_url = 'https://{user}:{pass}@language.ml/courses/nlp14032/'

user = "sharif1400"
pass_ = "sharif1400"

page = requests.get(f'https://{user}:{pass_}@language.ml/courses/nlp14032/index.html')

soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('tbody')
trs = table.find_all('tr')

for i in tqdm(trs):
    tds = i.find_all('td')
    print(tds[0].string)
    print('-'*10)
    
    if tds[0].string in lecture_names and tds[0].string!=None:
        continue
        
    lecture_names.append(tds[0].string)
    if tds[2].a!=None:
        print(base_url+tds[2].a.get('href'))
        r = requests.get(base_url+tds[2].a.get('href'), auth=(user, pass_))
        with open(tds[0].string+'.pdf', 'wb') as f:
            f.write(r.content)
    if tds[3].a!=None:
        print(base_url+tds[3].a.get('href'))
        r = requests.get(base_url+tds[3].a.get('href'), auth=(user, pass_))
        with open(tds[0].string+'.mp4', 'wb') as f:
            f.write(r.content)
        

lecture_names = [i for i in lecture_names if i!=None]
with open('download_log.txt', 'w') as f:
    f.write('\n'.join(lecture_names))


  0%|                                                                                            | 0/7 [00:00<?, ?it/s]

W1-1
----------
https://{user}:{pass}@language.ml/courses/nlp14032/slides/slide-1.pdf
https://{user}:{pass}@language.ml/courses/nlp14032/videos/01.mp4


  0%|                                                                                            | 0/7 [00:15<?, ?it/s]


KeyboardInterrupt: 

In [1]:
## Install Scrapy
!pip install scrapy
## Check installation
import scrapy
print(scrapy.__version__)

2.8.0


In [2]:
## Running the spider in Jupyter environment
from scrapy.crawler import CrawlerProcess
class Varzesh3Spider(scrapy.Spider):
    name = "varzesh3"
    start_urls = [
        'https://www.varzesh3.com/'
    ]
    def parse(self, response):
        matches = []
        for match in response.css('div.fixture-result-match'):
            matches.append({
                'date': match.xpath('preceding::div[@class="date-seprator"]/h4/text()').get(),
                'home_team': match.css('div.fixture-result-match-host span::text').get(),
                'away_team': match.css('div.fixture-result-match-guest span::text').get(),
                'score': match.css('div.fixture-result-match-goals::text').get(),
                'match_link': response.urljoin(match.css('a.fixture-result-match-teams::attr(href)').get()),
                'video_link': match.css('div.fixture-result-match-time a::attr(href)').get()
            })
        return matches
## Running the spider in Jupyter Notebook
process = CrawlerProcess({
    'FEEDS': {
        'matches.json': {'format': 'json'},
    },
})
process.crawl(Varzesh3Spider)
process.start()

2025-03-15 17:19:34 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2025-03-15 17:19:34 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Windows-10-10.0.19045-SP0
2025-03-15 17:19:34 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2025-03-15 17:19:34 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2025-03-15 17:19:34 [scrapy.extensions.telnet] INFO: Telnet Password: d04723a56206a094
2025-03-15 17:19:34 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 '

In [3]:
## Display JSON output
import json
with open('matches.json', 'r', encoding='utf-8') as f:
    data = json.load(f)
    print(json.dumps(data, indent=4, ensure_ascii=False))

[
    {
        "date": "پنج شنبه 23 اسفند",
        "home_team": "هوادار",
        "away_team": "شمس آذر قزوین",
        "score": null,
        "match_link": "https://www.varzesh3.com/football/match/415944/بازی-هوادار-شمس-آذر-قزوین",
        "video_link": "https://video.varzesh3.com/video/370703/خلاصه-بازی-هوادار-2-شمس-آذر-قزوین-3"
    },
    {
        "date": "پنج شنبه 23 اسفند",
        "home_team": "نساجی مازندران",
        "away_team": "سپاهان",
        "score": null,
        "match_link": "https://www.varzesh3.com/football/match/415948/بازی-نساجی-مازندران-سپاهان",
        "video_link": "https://video.varzesh3.com/video/370702/خلاصه-بازی-نساجی-مازندران-1-سپاهان-1"
    },
    {
        "date": "پنج شنبه 23 اسفند",
        "home_team": "خیبر خرم آباد",
        "away_team": "چادرملو اردکان",
        "score": null,
        "match_link": "https://www.varzesh3.com/football/match/415943/بازی-خیبر-خرم-آباد-چادرملو-اردکان",
        "video_link": "https://video.varzesh3.com/video/370704/