# Web Scraping 
In this notebook we will demonstrate webscraping using BeautifulSoup python package


## Install dependencies
Install `beautifulsoup4` and `lxml`

In [3]:
!conda install -y beautifulsoup4 lxml



  conda config --add channels defaults

For more information see https://docs.conda.io/projects/conda/en/stable/user-guide/configuration/use-condarc.html

  deprecated.topic(
Retrieving notices: done


  conda config --add channels defaults

For more information see https://docs.conda.io/projects/conda/en/stable/user-guide/configuration/use-condarc.html

  deprecated.topic(
Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /Users/smore/miniconda3/envs/emat

  added / updated specs:
    - beautifulsoup4
    - lxml


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2025.9.9   |       hca03da5_0         127 KB
    certifi-2025.10.5          |  py312hca03da5_0         157 KB
    libxslt-1.1.41             |       hf4d3faa_0         240 KB
    lxml-5.3.0           

In [6]:
## Import packages
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Prepare request


In [7]:
# Polite defaults
HEADERS = {
    "User-Agent": "EMAT-Teaching/1.0 (+instructor@example.edu)"
}
SESSION = requests.Session()
SESSION.headers.update(HEADERS)

demo_url = "https://quotes.toscrape.com/"

resp = SESSION.get(demo_url, timeout=20)

print(resp.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
    
    
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="auth

## Use BeautifulSoup
The BeautifulSoup class parses (analyzes and structures) that messy HTML into a navigable tree of elements.
It lets you search for tags like `<div>, <p>, or <a>` using methods such as .find() or .select().

In [9]:

soup = BeautifulSoup(resp.text)

In [11]:
# Print title
soup.title.get_text()

'Quotes to Scrape'

## Select Elements


In [12]:
rows = []
for q in soup.select("div.quote"):
    text = q.select_one("span.text")
    author = q.select_one("small.author")
    tag_nodes = q.select("div.tags a.tag")
    rows.append({
            "quote": text.get_text() if text else None,
            "author": author.get_text() if author else None,
            "tags": [t.get_text() for t in tag_nodes] if tag_nodes else [],
        })

### Convert the rows to Pandas


In [13]:
quotes_df = pd.DataFrame(rows)
quotes_df.head()

Unnamed: 0,quote,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"[change, deep-thoughts, thinking, world]"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"[abilities, choices]"
2,“There are only two ways to live your life. On...,Albert Einstein,"[inspirational, life, live, miracle, miracles]"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"[aliteracy, books, classic, humor]"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"[be-yourself, inspirational]"


# Save these to a CSV file


In [14]:
quotes_df.to_csv("quotes.csv")

# Read the file back and display

In [15]:
disk_df = pd.read_csv("quotes.csv")
disk_df.head()

Unnamed: 0.1,Unnamed: 0,quote,author,tags
0,0,“The world as we have created it is a process ...,Albert Einstein,"['change', 'deep-thoughts', 'thinking', 'world']"
1,1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"['abilities', 'choices']"
2,2,“There are only two ways to live your life. On...,Albert Einstein,"['inspirational', 'life', 'live', 'miracle', '..."
3,3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"['aliteracy', 'books', 'classic', 'humor']"
4,4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"['be-yourself', 'inspirational']"


# Read robots.txt Directly

In [22]:
robots_text = SESSION.get("https://www.google.com/robots.txt").text
print(robots_text[:500])  # peek

User-agent: *
User-agent: Yandex
Disallow: /search
Allow: /search/about
Allow: /search/howsearchworks
Disallow: /sdch
Disallow: /groups
Disallow: /index.html?
Disallow: /?
Allow: /?hl=
Disallow: /?hl=*&
Allow: /?hl=*&gws_rd=ssl$
Disallow: /?hl=*&*&gws_rd=ssl
Allow: /?gws_rd=ssl$
Allow: /?pt1=true$
Disallow: /imgres
Disallow: /u/
Disallow: /setprefs
Disallow: /m?
Disallow: /m/
Allow:    /m/finance
Disallow: /wml?
Disallow: /wml/?
Disallow: /wml/search?
Disallow: /xhtml?
Disallow: /xhtml/?
Disallo
