### Week 1 Contribution: Selenium-enhanced Website Summarizer
This notebook attempts to summarize content from any website using a BeautifulSoup-first strategy with a Selenium fallback for JavaScript-heavy pages. Llama 3.2 is used to generate a markdown-formatted summary.


In [0]:
import os
import requests
from bs4 import BeautifulSoup
from IPython.display import Markdown,display
from openai import OpenAI

In [0]:
MODEL="llama3.2"
openai=OpenAI(base_url="http://localhost:11434/v1",api_key="ollama")

In [0]:
message="Hi, write a snarky poem for me." 
response=openai.chat.completions.create(
    model=MODEL,
    messages=[{
        "role":"user",
        "content":message
    }]
)
print(response.choices[0].message.content)

### Beautiful Soup Version

In [0]:
headers = {
 "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
} # to make sure we're not blocked as bots from websites

class bsWebsite:
    """
    Attributes:
        url (str): The URL of the page
        title (str): The title of the page
        text (str): The readable text from the page
    """

    def __init__(self,url):
        self.url=url
        response=requests.get(url,headers=headers) # gets the content of the page in response variable

        soup=BeautifulSoup(response.content,'html.parser') # content of response is accessed using html parser for structure
        self.title=soup.title.string if soup.title else "No title"

        for irrelevant in soup.body(["script", "style", "img", "input"]):
            irrelevant.decompose()

        self.text=soup.body.get_text(separator='\n',strip=True)


In [0]:
ed = bsWebsite("https://edwarddonner.com")

print(ed.url)
print(ed.text)
print(ed.title)

#### Now, let's create a detailed summary for how selenium works using what we just made

In [0]:
sel=bsWebsite("https://www.geeksforgeeks.org/software-engineering/selenium-webdriver-tutorial/")
print(sel.url)
print(sel.title)

In [0]:
def user_prompt_for(web):
    user_prompt=f"""You are looking at a website called {web.title}. 
    Provide a detailed summary of the given content and the concepts in markdown:\n[{web.text}]"""

    return user_prompt

In [0]:
system_prompt="""You are an assistant that analyses the contents of a website based on request of user, 
while ignoring text that is navigation related. Respond in markdown."""

In [0]:
print(user_prompt_for(ed))

In [0]:
user_prompt=user_prompt_for(sel)

In [0]:
messages=[
    { "role":"system", "content":system_prompt},
    { "role":"user", "content":user_prompt}
]

In [0]:
response=openai.chat.completions.create(model=MODEL,messages=messages)

print(response.choices[0].message.content)

### Selenium Version

In [0]:
# making sure we're in the virtual environment
import sys
print(sys.executable)

In [0]:
# !pip install selenium

In [0]:
# !pip install webdriver-manager

In [0]:
from selenium import webdriver
from selenium.webdriver.edge.service import Service
# for edge only:
from webdriver_manager.microsoft import EdgeChromiumDriverManager

In [0]:
# works for edge only. Do not close the window that pops up as t will be used to open sites given.
driver=webdriver.Edge(service=Service(EdgeChromiumDriverManager().install()))

In [0]:
# creating a similar class as bsWebsie but using selenium
class SelWebsite:

    def __init__(self,url,driver):
        self.driver=driver
        self.driver.get(url)
        
        self.url=self.driver.current_url
        self.title=self.driver.title
        self.text=self.driver.find_element(By.TAG_NAME,"body").text

In [0]:
# testing it on OpenAI website
gpt=SelWebsite("https://openai.com",driver)
print(gpt.url)
print(gpt.driver)
print(gpt.title)
print(gpt.text)

##### Troubleshooting in case of errors:
1. Make sure the window popped up wasn't closed.
2. If the below cell results in any text except an error - driver ID is valid. In this case, quit and restart the driver again.
3. If driver ID is invalid, activate driver again using below cells.

In [0]:
# use the following code to check for valid session ID for driver if error occurs:
print(driver.session_id)

In [0]:
# if above is valid but still results in trouble, run both; otherwise run only the second part:
# driver.quit()
# driver = webdriver.Edge(service=Service(EdgeChromiumDriverManager().install()))

In [0]:
print(user_prompt_for(gpt))

In [0]:
messages2=[
    {"role":"system","content":system_prompt},
    {"role":"user","content":user_prompt_for(gpt)}
]

In [0]:
response=openai.chat.completions.create(model=MODEL,messages=messages2)

print(response.choices[0].message.content)

### Now let's build a summarize function which can be called directly to summarize any site.

In [0]:
def summarize(site_url):
    """
    Summarizes the visible content of a website.
    - Tries BeautifulSoup parsing first (bsWebsite)
    - Falls back to Selenium parsing (SelWebsite) if BS4 fails
    - Uses llama3.2 to generate a summary in Markdown
    """
    try:
        site=bsWebsite(site_url)
    except Exception as e:
        print(f"BS4 failed: {e}\nTrying Selenium...\n")
        site=SelWebsite(site_url,driver)

    messages3=[
        {"role":"system","content":system_prompt},
        {"role":"user","content":user_prompt_for(site)}
    ]

    print(f"\nSummarizing: {site.title}\nURL: {site.url}\n")

    response=openai.chat.completions.create(model=MODEL,messages=messages3)

    print(response.choices[0].message.content)

In [0]:
summarize("https://www.udemy.com")