# Web Scraping Basics
**웹 스크래핑 기초**

**Duration (수업 시간)**: 3 hours (3시간)  
**Structure (구성)**: Lecture & Lab 2 hours + Quiz 1 hour (강의 및 실습 2시간 + 퀴즈 1시간)  
**Level (수준)**: Intermediate (중급)

---

## Learning Objectives (학습 목표)

By the end of this lesson, students will be able to:
이 수업을 마친 후 학생들은 다음을 할 수 있습니다:

- Understand what web scraping is and when to use it (웹 스크래핑이 무엇이고 언제 사용하는지 이해)
- Use requests library to get web page content (requests 라이브러리를 사용하여 웹 페이지 내용 가져오기)
- Parse HTML using BeautifulSoup (BeautifulSoup을 사용하여 HTML 파싱)
- Extract specific information from web pages (웹 페이지에서 특정 정보 추출)
- Understand web scraping ethics and legal considerations (웹 스크래핑 윤리와 법적 고려사항 이해)

---

## 1. What is Web Scraping? (웹 스크래핑이란?)

Web scraping is like copying information from websites automatically. Instead of manually copying and pasting, you write code to do it for you.
웹 스크래핑은 웹사이트에서 정보를 자동으로 복사하는 것과 같습니다. 수동으로 복사하고 붙여넣기 하는 대신, 코드를 작성해서 대신 해주는 것입니다.

### Why Use Web Scraping? (왜 웹 스크래핑을 사용하나요?)

- **Save time**: Get lots of data quickly (시간 절약: 많은 데이터를 빠르게 얻기)
- **Get current data**: Always get the latest information (최신 데이터 얻기: 항상 최신 정보 얻기)
- **Analyze trends**: Compare prices, monitor changes (트렌드 분석: 가격 비교, 변화 모니터링)

### Simple Example (간단한 예시)

Think of web scraping like having a robot assistant that:
웹 스크래핑을 다음과 같은 로봇 어시스턴트가 있는 것처럼 생각해보세요:
- Visits websites for you (당신을 위해 웹사이트 방문)
- Reads the content (내용 읽기)
- Copies important information (중요한 정보 복사)
- Saves it in a file (파일에 저장)

---

## 2. Introduction to Requests Library (Requests 라이브러리 소개)

The **requests** library helps you get web pages from the internet. It's like a web browser, but for your Python code.
**requests** 라이브러리는 인터넷에서 웹 페이지를 가져오는 데 도움을 줍니다. 웹 브라우저와 같지만 파이썬 코드용입니다.

### Installing Requests (Requests 설치)

In [None]:
%%bash
pip install requests

### Basic Requests Usage (기본 Requests 사용법)

In [None]:
import requests

# Get a web page
response = requests.get('https://httpbin.org/html')

# Check if request was successful
print(f"Status code: {response.status_code}")

# Get the content
content = response.text
print("Page content:")
print(content[:200])  # Show first 200 characters

---

## 3. HTTP Requests and Responses (HTTP 요청과 응답)

When you visit a website, your browser sends a **request** and gets a **response**. Web scraping works the same way.
웹사이트를 방문할 때, 브라우저는 **요청**을 보내고 **응답**을 받습니다. 웹 스크래핑도 같은 방식으로 작동합니다.

### Understanding Status Codes (상태 코드 이해)

- **200**: Success! Page found (성공! 페이지 찾음)
- **404**: Page not found (페이지를 찾을 수 없음)
- **403**: Access denied (접근 거부)

### Simple Status Check (간단한 상태 확인)

In [None]:
import requests

def check_website(url):
    try:
        response = requests.get(url)
        if response.status_code == 200:
            print(f"Success! Got {len(response.text)} characters")
        else:
            print(f"Error: Status code {response.status_code}")
    except requests.exceptions.RequestException:
        print("Could not connect to website")

# Test with a simple website
check_website('https://httpbin.org/html')

---

## 4. Introduction to BeautifulSoup (BeautifulSoup 소개)

**BeautifulSoup** helps you find specific information in HTML code. It's like having a smart search tool for web pages.
**BeautifulSoup**은 HTML 코드에서 특정 정보를 찾는 데 도움을 줍니다. 웹 페이지를 위한 스마트 검색 도구가 있는 것과 같습니다.

### Installing BeautifulSoup (BeautifulSoup 설치)

In [None]:
%%bash
pip install beautifulsoup4

### Basic BeautifulSoup Usage (기본 BeautifulSoup 사용법)

In [None]:
import requests
from bs4 import BeautifulSoup

# Get a web page
response = requests.get('https://httpbin.org/html')
html_content = response.text

# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

# Find the title
title = soup.find('title')
if title:
    print(f"Page title: {title.text}")

# Find all headings
headings = soup.find_all('h1')
for heading in headings:
    print(f"Heading: {heading.text}")

---

## 5. HTML Parsing and Element Selection (HTML 파싱 및 요소 선택)

HTML has tags like `<title>`, `<h1>`, `<p>` that organize content. BeautifulSoup helps you find these tags.
HTML에는 내용을 구성하는 `<title>`, `<h1>`, `<p>`와 같은 태그가 있습니다. BeautifulSoup은 이러한 태그를 찾는 데 도움을 줍니다.

### Common HTML Tags (일반적인 HTML 태그)

- `<title>`: Page title (페이지 제목)
- `<h1>, <h2>`: Headings (제목들)
- `<p>`: Paragraphs (단락)
- `<a>`: Links (링크)
- `<img>`: Images (이미지)

### Finding Elements (요소 찾기)

In [None]:
from bs4 import BeautifulSoup

# Sample HTML
html = """
<html>
<head><title>Sample Page</title></head>
<body>
    <h1>Welcome</h1>
    <p>This is a paragraph.</p>
    <p>This is another paragraph.</p>
    <a href="https://example.com">Click here</a>
</body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')

# Find single element
title = soup.find('title').text
print(f"Title: {title}")

# Find all paragraphs
paragraphs = soup.find_all('p')
for i, p in enumerate(paragraphs, 1):
    print(f"Paragraph {i}: {p.text}")

# Find link
link = soup.find('a')
print(f"Link text: {link.text}")
print(f"Link URL: {link.get('href')}")

---

## 6. Web Scraping Ethics and Legal Considerations (웹 스크래핑 윤리와 법적 고려사항)

### Important Rules (중요한 규칙)

1. **Check robots.txt**: See what's allowed (robots.txt 확인: 허용되는 것 확인)
2. **Don't overload servers**: Be polite, add delays (서버 과부하 금지: 예의 지키기, 지연 추가)
3. **Respect copyright**: Don't steal content (저작권 존중: 콘텐츠 도용 금지)
4. **Read terms of service**: Follow website rules (서비스 약관 읽기: 웹사이트 규칙 따르기)

### Best Practices (모범 사례)

In [None]:
import requests
import time

def polite_scraping(url):
    # Add delay to be polite
    time.sleep(1)
    
    # Use headers to identify yourself
    headers = {
        'User-Agent': 'Educational Bot 1.0'
    }
    
    try:
        response = requests.get(url, headers=headers)
        return response
    except:
        print("Could not access website")
        return None

---

## Lab Exercises (실습)

### Lab 1: Simple Web Page Content Extraction (간단한 웹페이지 내용 추출)

**Problem**: Extract the title and main heading from a simple web page.
문제: 간단한 웹 페이지에서 제목과 주요 헤딩을 추출하세요.

**Solution**:

In [None]:
import requests
from bs4 import BeautifulSoup

def extract_basic_info(url):
    try:
        # Get the web page
        response = requests.get(url)
        
        if response.status_code == 200:
            # Parse HTML
            soup = BeautifulSoup(response.text, 'html.parser')
            
            # Extract title
            title_tag = soup.find('title')
            title = title_tag.text if title_tag else "No title found"
            
            # Extract main heading
            h1_tag = soup.find('h1')
            heading = h1_tag.text if h1_tag else "No heading found"
            
            # Display results
            print(f"Website: {url}")
            print(f"Title: {title}")
            print(f"Main Heading: {heading}")
            print("-" * 50)
            
        else:
            print(f"Error: Could not access {url}")
            
    except Exception as e:
        print(f"Error occurred: {e}")

# Test with a simple website
extract_basic_info('https://httpbin.org/html')

# You can test with other simple websites
# extract_basic_info('https://example.com')

### Lab 2: News Headline Collector (뉴스 제목 수집기)

**Problem**: Create a simple headline collector from a basic HTML structure.
문제: 기본 HTML 구조에서 간단한 헤드라인 수집기를 만드세요.

**Solution**:

In [None]:
import requests
from bs4 import BeautifulSoup

def collect_headlines(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Find all heading tags (h1, h2, h3)
    headlines = []
    
    for tag in ['h1', 'h2', 'h3']:
        elements = soup.find_all(tag)
        for element in elements:
            headlines.append(element.text.strip())
    
    return headlines

def save_headlines_to_file(headlines, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        file.write("Collected Headlines\n")
        file.write("=" * 20 + "\n\n")
        
        for i, headline in enumerate(headlines, 1):
            file.write(f"{i}. {headline}\n")
    
    print(f"Headlines saved to {filename}")

# Sample HTML with news-like structure
sample_html = """
<html>
<head><title>News Site</title></head>
<body>
    <h1>Breaking News</h1>
    <h2>Technology Update</h2>
    <h2>Sports Results</h2>
    <h3>Weather Forecast</h3>
    <h3>Local Events</h3>
</body>
</html>
"""

# Extract headlines
headlines = collect_headlines(sample_html)

print("Found Headlines:")
for i, headline in enumerate(headlines, 1):
    print(f"{i}. {headline}")

# Save to file
save_headlines_to_file(headlines, 'headlines.txt')

### Lab 3: Product Price Information Scraper (상품 가격 정보 크롤러)

**Problem**: Extract product information from a simple HTML structure.
문제: 간단한 HTML 구조에서 제품 정보를 추출하세요.

**Solution**:

In [None]:
import requests
from bs4 import BeautifulSoup

def extract_product_info(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    products = []
    
    # Look for product containers (divs with class 'product')
    product_divs = soup.find_all('div', class_='product')
    
    for product_div in product_divs:
        # Extract product name
        name_tag = product_div.find('h3')
        name = name_tag.text.strip() if name_tag else "Unknown"
        
        # Extract price
        price_tag = product_div.find('span', class_='price')
        price = price_tag.text.strip() if price_tag else "No price"
        
        products.append({'name': name, 'price': price})
    
    return products

def save_products_to_csv(products, filename):
    with open(filename, 'w', encoding='utf-8') as file:
        file.write("Product Name,Price\n")
        for product in products:
            file.write(f"{product['name']},{product['price']}\n")
    
    print(f"Product data saved to {filename}")

# Sample HTML with product structure
sample_html = """
<html>
<body>
    <div class="product">
        <h3>Laptop Computer</h3>
        <span class="price">$999</span>
    </div>
    <div class="product">
        <h3>Wireless Mouse</h3>
        <span class="price">$25</span>
    </div>
    <div class="product">
        <h3>Keyboard</h3>
        <span class="price">$75</span>
    </div>
</body>
</html>
"""

# Extract product information
products = extract_product_info(sample_html)

print("Found Products:")
for product in products:
    print(f"Name: {product['name']}, Price: {product['price']}")

# Save to CSV file
save_products_to_csv(products, 'products.csv')

print(f"\nTotal products found: {len(products)}")

---

## Quiz Section (퀴즈)

### Quiz 1: Basic Web Requests

**Question**: Use requests library to get webpage content from 'https://httpbin.org/html' and print the status code. Also print the first 100 characters of the content.

requests 라이브러리를 사용하여 'https://httpbin.org/html'에서 웹페이지 내용을 가져오고 상태 코드를 출력하세요. 또한 내용의 첫 100자를 출력하세요.

**Write your answer here (답을 여기에 작성하세요)**:

In [None]:
# Your code here

### Quiz 2: HTML Parsing with BeautifulSoup

**Question**: Given the HTML below, use BeautifulSoup to extract and print all title tags. Also extract all paragraph texts.

아래 주어진 HTML에서 BeautifulSoup을 사용하여 모든 title 태그를 추출하고 출력하세요. 또한 모든 단락 텍스트도 추출하세요.

In [None]:
html_content = """
<html>
<head><title>Test Page</title></head>
<body>
    <title>Another Title</title>
    <p>First paragraph</p>
    <p>Second paragraph</p>
</body>
</html>
"""

**Write your answer here (답을 여기에 작성하세요)**:

In [None]:
# Your code here

### Quiz 3: News Headlines Scraper

**Question**: Write a program that extracts all h1 and h2 headings from HTML content and saves them to a text file called 'news_headlines.txt'. Use the sample HTML provided.

HTML 내용에서 모든 h1과 h2 헤딩을 추출하여 'news_headlines.txt'라는 텍스트 파일에 저장하는 프로그램을 작성하세요. 제공된 샘플 HTML을 사용하세요.

In [None]:
sample_html = """
<html>
<body>
    <h1>Main News Story</h1>
    <h2>Technology News</h2>
    <h2>Sports Update</h2>
    <p>Some content here</p>
    <h1>Breaking News</h1>
</body>
</html>
"""

**Write your answer here (답을 여기에 작성하세요)**:

In [None]:
# Your code here

---

## References (참고)

1. **Requests Documentation**: https://requests.readthedocs.io/en/latest/
2. **BeautifulSoup Tutorial**: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
3. **Web Scraping Ethics Guide**: https://blog.apify.com/web-scraping-ethics/
4. **Real Python Web Scraping**: https://realpython.com/beautiful-soup-web-scraper-python/

---

## Key Points (핵심 포인트)

### Remember (기억하세요)
1. **Always check status code**: 200 means success (항상 상태 코드 확인: 200은 성공 의미)
2. **Be respectful**: Don't overload websites (예의 지키기: 웹사이트에 과부하 주지 말기)
3. **Handle errors**: Use try-except for safety (오류 처리: 안전을 위해 try-except 사용)
4. **Follow the law**: Respect robots.txt and terms of service (법 준수: robots.txt와 서비스 약관 존중)

### Safety Tips (안전 팁)
- Start with simple, educational websites (간단한 교육용 웹사이트부터 시작)
- Always add delays between requests (요청 사이에 항상 지연 추가)
- Don't scrape personal or sensitive data (개인적이거나 민감한 데이터 스크래핑 금지)
- Test your code with small examples first (먼저 작은 예제로 코드 테스트)

### Next Week Preview (다음 주 미리보기)
Next week: **API Basics and Weather Data** - Getting data through official APIs
다음 주: **API 기초 및 날씨 데이터** - 공식 API를 통한 데이터 가져오기

---

## Homework (숙제)

1. Complete all three lab exercises (3개 실습 모두 완료)
2. Practice with the sample HTML provided in exercises (실습에서 제공된 샘플 HTML로 연습)
3. Research robots.txt for a website you're interested in (관심 있는 웹사이트의 robots.txt 조사)
4. Try extracting different HTML elements (div, span, etc.) (다른 HTML 요소들 추출 시도)

**Web scraping is powerful, but use it responsibly and ethically!**
**웹 스크래핑은 강력하지만 책임감 있고 윤리적으로 사용하세요!**