# Exercise 1
### Giang Nguyen - Student ID `101593014`

> Creating a Python script to enable web scraping of an url

___

## Step 1: Import libs and get URL from user's input

In [1]:
import os
import re
import requests
from bs4 import BeautifulSoup

# Example: https://jasoncmcbride.medium.com/how-to-create-a-life-you-love-0750e852475a
# Example: https://edition.cnn.com/travel/hagia-sophia-istanbul-history-secrets

url = ""
while not re.match("^https?://.*", url):
    url = input("Enter web URL: ")
    # If URL is not valid, then try again until a valid URL is provided

___
## Step 2: Load HTML text and analyze by BS4
HTTP exceptions will be handled in case of connectivity failure

In [2]:
try:
    html = requests.get(url).text
except Exception as e:
    print("Inaccessible or invalid URL\n", e)
    exit(1)
doc = BeautifulSoup(html, "html.parser", preserve_whitespace_tags=["div", "section", "p", "span"])
doc = BeautifulSoup(doc.prettify(), "html.parser")

___
## Step 3: Refine HTML and text content
Sub-processes:
1. Remove irrelevant elements, i.e `<nav>`, `<header>`
2. Read elements into lines of text
3. Break the process if there's no text line is read

In [3]:
for el in doc.find_all("header") + doc.find_all("script") + doc.find_all("nav") + doc.find_all("footer") + doc.find_all("nav"):
    el.decompose()

lines = doc.get_text(separator="\n", strip=True).splitlines()

if len(lines) <= 0:
    print("No content found")
    exit(1)

filecontent = "\n\r".join(lines)

___
## Step 4: Save to file
File name is the text content extracted from `<title>` tag. File will be removed if exists before the `write` process.

In [4]:
filename = re.sub(r'[^a-zA-Z0-9()\',_\-\s]', "", doc.title.string.strip()) + ".txt"
if filename == ".txt":
    filename = "extracted_document.txt"
if len(filename) >= 150:
    filename = filename[:150] + ".txt"
if os.path.exists(filename):
    os.remove(filename)

with open(filename, "w", encoding="utf-8") as file:
    file.write(filecontent)
    file.flush()
    file.close()
    print("Saved content to file: ", filename)

Saved content to file:  Hagia Sophia Secrets of the 1,600-year-old megastructure that has survived the collapse of empires  CNN.txt
