# HTTP communication & web scraping using Python built-in libraries

This Jupyter notebook shows how to use **Python's standard library** to perform HTTP requests and basic web scraping. It avoids third-party libraries (like `requests` or `beautifulsoup4`) by design so you can learn the low-level building blocks.

**Contents:**

1. HTTP basics with `http.client` and `urllib`.
2. Building GET and POST requests.
3. Handling query strings and forms (`urllib.parse`).
4. Cookies and sessions with `http.cookiejar` + `urllib.request`.
5. Parsing HTML with `html.parser` and `xml.etree`.
6. Downloading files (images, PDFs).
7. Respecting `robots.txt`, headers, and rate limiting.
8. Practical examples and mini-projects.


## 1) `http.client` — low-level HTTP

`http.client` is a low-level module for HTTP/1.1 communication. You create a connection to a host and then send requests.

Example: perform a simple GET request to `example.com` and inspect the response.

In [1]:
import http.client

conn = http.client.HTTPSConnection('example.com')
conn.request('GET', '/')
resp = conn.getresponse()
print('Status:', resp.status)
print('Reason:', resp.reason)
headers = resp.getheaders()
print('\nHeaders:')
for k,v in headers:
    print(f"{k}: {v}")
body = resp.read(500)  # read first 500 bytes
print('\nFirst 500 bytes of body:\n')
print(body.decode('utf-8', errors='replace'))
conn.close()

Status: 200
Reason: OK

Headers:
Accept-Ranges: bytes
Content-Type: text/html
ETag: "84238dfc8092e5d9c0dac8ef93371a07:1736799080.121134"
Last-Modified: Mon, 13 Jan 2025 20:11:20 GMT
Content-Length: 1256
Cache-Control: max-age=86000
Date: Sat, 04 Oct 2025 12:56:39 GMT
Connection: keep-alive
Alt-Svc: h3=":443"; ma=93600

First 500 bytes of body:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
    


### Building GET requests with query strings

Use `urllib.parse` to build query strings safely.

In [2]:
from urllib.parse import urlencode, urlunparse

params = {'q': 'python http client', 'page': 1}
qs = urlencode(params)
url = urlunparse(('https', 'www.example.com', '/search', '', qs, ''))
print('Full URL:', url)

Full URL: https://www.example.com/search?q=python+http+client&page=1


### POST requests (form data) — `http.client` + `urllib.parse`

You must encode form fields as `application/x-www-form-urlencoded`.


In [3]:
import http.client
from urllib.parse import urlencode

params = {'username':'alice', 'password':'secret'}
body = urlencode(params)
headers = {'Content-Type':'application/x-www-form-urlencoded', 'User-Agent':'Python-Stdlib-Client/1.0'}

conn = http.client.HTTPSConnection('httpbin.org')
conn.request('POST', '/post', body=body, headers=headers)
resp = conn.getresponse()
print('Status:', resp.status)
print('Reason:', resp.reason)
resp_body = resp.read().decode('utf-8')
print('\nResponse body (truncated 800 chars):\n')
print(resp_body[:800])
conn.close()

Status: 200
Reason: OK

Response body (truncated 800 chars):

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "password": "secret", 
    "username": "alice"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "30", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-Stdlib-Client/1.0", 
    "X-Amzn-Trace-Id": "Root=1-68e11996-488668627ca5bf9c1258b5fe"
  }, 
  "json": null, 
  "origin": "103.79.252.177", 
  "url": "https://httpbin.org/post"
}



## 2) `urllib.request` — higher-level convenience wrapper

`urllib.request` wraps `http.client` and handles many details for you. It also integrates with `http.cookiejar` for cookies.

Example: simple GET using `urllib.request.urlopen`.

In [None]:
from urllib.request import urlopen, Request

req = Request('https://example.com', headers={'User-Agent':'Python-urllib/3'})
with urlopen(req) as r:
    print('Status:', r.status)
    print('Content-Type:', r.getheader('Content-Type'))
    text = r.read(400).decode('utf-8', errors='replace')
    print('\nFirst 400 bytes:\n')
    print(text)

### 3) Cookies & simple 'session' handling using `http.cookiejar` + `urllib`

In [4]:
import http.cookiejar
from urllib.request import build_opener, HTTPCookieProcessor, Request

cj = http.cookiejar.CookieJar()
opener = build_opener(HTTPCookieProcessor(cj))

# Example: set a cookie via httpbin and then access /cookies
req = Request('https://httpbin.org/cookies/set/sessioncookie/123456', headers={'User-Agent':'Python-http-cookie-example'})
with opener.open(req) as r:
    print('Set cookie status:', r.status)

req2 = Request('https://httpbin.org/cookies', headers={'User-Agent':'Python-http-cookie-example'})
with opener.open(req2) as r:
    print('\nCookies endpoint response:\n')
    print(r.read().decode('utf-8'))

print('\nStored cookies in cookie jar:')
for cookie in cj:
    print(cookie)

Set cookie status: 200

Cookies endpoint response:

{
  "cookies": {
    "sessioncookie": "123456"
  }
}


Stored cookies in cookie jar:
<Cookie sessioncookie=123456 for httpbin.org/>


### 4) Respect `robots.txt`

Before scraping, check `robots.txt` to see allowed paths. Python has `urllib.robotparser`.

Example: parse and check whether `/` is allowed for a user-agent.

In [5]:
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
print('Can fetch / ? ', rp.can_fetch('*', 'https://example.com/'))
print('Crawl-delay (may be None):', getattr(rp, 'crawl_delay', None))

Can fetch / ?  True
Crawl-delay (may be None): <bound method RobotFileParser.crawl_delay of <urllib.robotparser.RobotFileParser object at 0x7138b81bd580>>


### 5) Parsing HTML with the built-in `html.parser` (HTMLParser)

`HTMLParser` is an event-driven parser suitable for simple extraction tasks. For complicated extraction use a proper parser (e.g., BeautifulSoup), but `HTMLParser` is useful when you cannot install third-party libs.

Example: extract all links (`<a href="...">`).

In [6]:
from html.parser import HTMLParser

class LinkExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.links = []
    def handle_starttag(self, tag, attrs):
        if tag == 'a':
            for name, value in attrs:
                if name == 'href':
                    self.links.append(value)

# small sample HTML
sample = '''
<html><body>
<a href="https://example.com/about">About</a>
<a href="/local/path">Local</a>
</body></html>
'''

le = LinkExtractor()
le.feed(sample)
print('Found links:')
print(le.links)

Found links:
['https://example.com/about', '/local/path']


### 6) `xml.etree.ElementTree` for well-formed XML/HTML

If the HTML is well-formed or you're working with XML, `xml.etree` provides XPath-like traversal.

In [7]:
import xml.etree.ElementTree as ET

xml = '<root><item id="1">One</item><item id="2">Two</item></root>'
root = ET.fromstring(xml)
for it in root.findall('item'):
    print(it.get('id'), it.text)

1 One
2 Two


### 7) Downloading files (images, PDFs) with `urllib.request`

Use `urlretrieve` or `urlopen` + streaming write for large files.


In [8]:
from urllib.request import urlretrieve

# WARNING: On some environments this might not allow network access. Example URL shown.
print('Example (disabled run by default):')
print('urlretrieve("https://example.com/image.png", "image.png")')

# Recommended pattern for streaming large files:
code = '''\
from urllib.request import urlopen, Request
req = Request('https://example.com', headers={'User-Agent': 'Python-download-example'})
with urlopen(req) as r, open('downloaded_example.html', 'wb') as out:
    chunk = r.read(8192)
    while chunk:
        out.write(chunk)
        chunk = r.read(8192)
'''
print('\nStreaming example code (save and run locally if you have internet):\n')
print(code)

Example (disabled run by default):
urlretrieve("https://example.com/image.png", "image.png")

Streaming example code (save and run locally if you have internet):

from urllib.request import urlopen, Request
req = Request('https://example.com', headers={'User-Agent': 'Python-download-example'})
with urlopen(req) as r, open('downloaded_example.html', 'wb') as out:
    chunk = r.read(8192)
    while chunk:
        out.write(chunk)
        chunk = r.read(8192)



### 8) Politeness, headers, and rate limiting

- Always send a descriptive `User-Agent` and your contact information if doing large-scale scraping.
- Respect `robots.txt` and the site's `Crawl-delay` if present.
- Add delays (e.g., `time.sleep(1)`) between requests.
- Avoid hammering servers — use exponential backoff on failures.

Example polite request snippet:

In [9]:
import time
from urllib.request import Request, urlopen

req = Request('https://example.com', headers={'User-Agent':'pratham-example-scraper/0.1 (+mailto:you@example.com)'})
# polite loop (do not run fast)
for i in range(3):
    # local demo only; in live scraping insert try/except + backoff
    print('Would fetch iteration', i+1)
    time.sleep(1)  # sleep between requests


Would fetch iteration 1
Would fetch iteration 2
Would fetch iteration 3


## Practical example: extracting headings from a real page

Below is an example that fetches a page and extracts `<h1>`-`<h3>` headings using `HTMLParser`.

(If your environment blocks outgoing HTTP, use the `sample_html` variable to test.)

In [12]:
from html.parser import HTMLParser
from urllib.request import Request, urlopen

class HeadingExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_heading = None
        self.headings = []
    def handle_starttag(self, tag, attrs):
        if tag in ('h1','h2','h3'):
            self.in_heading = tag
            self._buf = ''
    def handle_endtag(self, tag):
        if self.in_heading == tag:
            self.headings.append((tag, self._buf.strip()))
            self.in_heading = None
    def handle_data(self, data):
        if self.in_heading:
            self._buf += data

# Try live fetch (may fail in offline env). If fails, fallback to sample_html
url = 'https://example.com'
req = Request(url, headers={'User-Agent':'Python-headline-extractor/0.1'})
try:
    with urlopen(req) as r:
        html = r.read().decode('utf-8', errors='replace')
except Exception as e:
    print('Live fetch failed or blocked in this environment:', e)
    html = '<html><body><h1>Sample Title</h1><h2>Subtitle</h2><p>Text</p></body></html>'

he = HeadingExtractor()
he.feed(html)
print('Headings found:')
for tag, text in he.headings:
    print(tag, '-', text)

Headings found:
h1 - Example Domain


## We can furture impliment Mini-projects

1. **Write a polite crawler** that given a start URL crawls up to N pages on the same domain and saves page titles. Use `urllib`, `html.parser`, `robotparser`, and `cookiejar`.

2. **Image downloader**: write a script that downloads all images on a page into a folder, skipping duplicates.

3. **Form submitter**: simulate a login form (to a test site like httpbin) using `http.client` POST and show how cookies persist with `cookiejar`.

