# 02_Website_Scraping_Playwright

This notebook demonstrates how to scrape dynamic website content using Playwright and handle authentication for LinkedIn and Medium.

We will cover the installation of necessary packages, configuration, and connection to the websites, as well as converting
the scraped content into vector embeddings using LangChain and summarizing the content from memory in markdown format.

## Step 1: Install Necessary Packages

First, we need to install the required packages. Run the following command to install Playwright, LangChain, and other dependencies.

In [None]:
%pip install -q playwright beautifulsoup4 aiohttp langchain ollama selenium

In [None]:
%pip install -U pip

In [None]:
# Configure logging
import logging

from utils.logger import logger

In [None]:
# LinkedIn login endpoint
login_url = "https://www.linkedin.com/login"

# Perform login
login_data = {"session_key": "pieter.kuppens@gmail.com", "session_password": input("Enter your LinkedIn password: ")}

profile_url = "https://www.linkedin.com/in/pieterkuppens"

## 1. Export Your Data First (Preferred Legal Approach)

LinkedIn allows you to download your profile data directly:

Go to LinkedIn Data Export. 
If no export is possible, go to: https://www.linkedin.com/mypreferences/d/settings/data-export-by-page-admins

Request a full export of your data. LinkedIn will email you a ZIP file containing your profile, connections, and more.
If this doesn't suffice, you can proceed with scraping as outlined below.

Note that this notebook also intends to read and summarize linkedin articles later.

### Check if your use case violates the terms of use if it exceeds personal use.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

# Set up WebDriver
driver = webdriver.Chrome()
driver.get("https://www.linkedin.com/login")

# Login
username = driver.find_element(By.ID, "username")
password = driver.find_element(By.ID, "password")

username.send_keys(login_data["session_key"])
password.send_keys(login_data["session_password"])
password.send_keys(Keys.RETURN)

# Wait for login to complete
time.sleep(5)

# Navigate to your profile
driver.get(profile_url)

# Get page source
profile_page = driver.page_source
driver.quit()

# Parse the HTML with BeautifulSoup
from bs4 import BeautifulSoup

soup = BeautifulSoup(profile_page, "html.parser")
print(soup.prettify())

In [None]:
import aiohttp
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"}


async def extract_soup_from_url(url):
    async with aiohttp.ClientSession() as session:
        try:
            # async with session.post(login_url, data=login_data, headers=headers) as login_response:
            #    login_result = login_response

            async with session.get(url, headers=headers) as response:
                response.raise_for_status()
                html_content = await response.text()

            soup = BeautifulSoup(html_content, "html.parser")
        except aiohttp.ClientError as e:
            logger.error(f"Network error during scraping: {e}")
            raise
        except Exception as e:
            logger.error(f"Unexpected error in profile extraction: {e}")
            raise RuntimeError("Profile scraping failed") from e

        return soup


soup = await extract_soup_from_url(profile_url)

In [None]:
# Unfortunatly, title and h1 are clear where to find, but the headline is not.

{
    "name": soup.find("h1", class_="top-card-layout__title").text.strip(),
    "headline": soup.find("h2").text.strip(),  # class_='top-card-layout__headline').text.strip(),
    # "soup": soup
}

In [None]:
linkedin_article_link = "https://www.linkedin.com/pulse/why-should-you-learn-python-2022-oliver-veits"

# This probably also needs rewrite to replace the aiohttp client session with the selenium solution.
soup_article = await extract_soup_from_url(linkedin_article_link)

{"soup": soup_article}

# Note that we detect here that a login is required to access the article content.

## Step 2: Import Packages

Next, we will import the necessary packages for our setup.

In [None]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import LLMChain
from langchain.llms import OpenAI
import os

## Step 3: Handle Authentication

We need to handle authentication for LinkedIn and Medium.

The following code sets up the connection and handles the login process.

In [None]:
def login_linkedin(page, username, password):
    # First attempt, based on Playwright, needs rewrite to aiohttp?!
    page.goto("https://www.linkedin.com/login")
    page.fill('input[name="session_key"]', username)
    page.fill('input[name="session_password"]', password)
    page.click('button[type="submit"]')


def login_medium(page, username, password):
    page.goto("https://medium.com/m/signin")
    page.fill('input[name="email"]', username)
    page.click('button[type="submit"]')
    # Medium sends a login link to the email, so manual intervention is needed here
    print("Please check your email and click the login link sent by Medium.")

## Step 4: Scrape Dynamic Website Content

We will use Playwright to scrape dynamic website content from LinkedIn and Medium.

In [None]:
def scrape_linkedin_profile(page, profile_url):
    # Old prlaywright based code
    page.goto(profile_url)
    content = page.content()
    return content


def scrape_medium_article(page, article_url):
    page.goto(article_url)
    content = page.content()
    return content

## Step 4.1: Check if login + scraping works

We'll check if the login works, by going to my personal linkedin page without login, and check what happens.

If it fails, then we'll log in and try again.

Then we'll repeat this for a content page that might be more hidden than a public profile.

In [None]:
linkedin_profile = "https://www.linkedin.com/in/pieterkuppens"

In [None]:
linkedin_content_no_login = scrape_linkedin_profile(page, linkedin_profile)

## Step 5: Convert Scraped Content into Vector Embeddings

We will use LangChain to convert the scraped content into vector embeddings and store them in a Chroma vector store.

In [None]:
def convert_to_embeddings(content):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(content)
    embeddings = OpenAIEmbeddings()
    vector_store = Chroma.from_texts(texts, embeddings)
    return vector_store

## Step 6: Summarize Content from Memory

We will use LangChain to summarize the content from memory in markdown format.

In [None]:
def summarize_content(vector_store):
    llm = OpenAI()
    chain = LLMChain(llm=llm)
    summary = chain.run(vector_store)
    return summary

## Step 7: Interactive Demonstration

We will demonstrate the entire process interactively, with clear markdown cells explaining each step.

In [None]:
# Step 7: Interactive Demonstration with Selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
import os


def run_scraping_demo():
    # Set up WebDriver
    driver = webdriver.Chrome()  # Or use webdriver.Firefox() if you prefer Firefox

    try:
        # LinkedIn login and scraping
        linkedin_username = os.getenv("LINKEDIN_USERNAME", "your_email@example.com")
        linkedin_password = os.getenv("LINKEDIN_PASSWORD", "your_password")

        # If environment variables aren't set, you can use the input from earlier cells
        if not linkedin_password or linkedin_password == "your_password":
            # Use the login_data dictionary if already defined in your notebook
            try:
                linkedin_username = login_data["session_key"]
                linkedin_password = login_data["session_password"]
            except NameError:
                print("Please set your LinkedIn credentials as environment variables or in the login_data dictionary.")
                return

        # Login to LinkedIn
        linkedin_content = scrape_linkedin_with_selenium(driver, linkedin_username, linkedin_password)

        # Convert LinkedIn content to embeddings
        try:
            linkedin_vector_store = convert_to_embeddings(linkedin_content)
            linkedin_summary = summarize_content(linkedin_vector_store)
            print("LinkedIn Summary:", linkedin_summary)
        except Exception as e:
            print(f"Error processing LinkedIn content: {e}")

        # Medium login and scraping
        medium_username = os.getenv("MEDIUM_USERNAME", "your_email@example.com")
        medium_password = os.getenv("MEDIUM_PASSWORD", "your_password")

        # Scrape Medium article
        try:
            medium_article_url = "https://medium.com/some-article"
            medium_content = scrape_medium_with_selenium(driver, medium_article_url, medium_username, medium_password)

            # Convert Medium content to embeddings
            medium_vector_store = convert_to_embeddings(medium_content)
            medium_summary = summarize_content(medium_vector_store)
            print("Medium Summary:", medium_summary)
        except Exception as e:
            print(f"Error processing Medium content: {e}")

    finally:
        # Always close the driver
        driver.quit()


def scrape_linkedin_with_selenium(driver, username, password, profile_url="https://www.linkedin.com/in/pieterkuppens"):
    # Open LinkedIn login page
    driver.get("https://www.linkedin.com/login")

    # Login
    username_field = driver.find_element(By.ID, "username")
    password_field = driver.find_element(By.ID, "password")

    username_field.send_keys(username)
    password_field.send_keys(password)
    password_field.send_keys(Keys.RETURN)

    # Wait for login to complete
    time.sleep(5)

    # Navigate to profile
    driver.get(profile_url)

    # Wait for page to load
    time.sleep(3)

    # Get page source
    profile_page = driver.page_source

    # Parse with BeautifulSoup
    soup = BeautifulSoup(profile_page, "html.parser")

    # Extract profile information
    try:
        name = soup.find("h1", class_="text-heading-xlarge").text.strip()
        print(f"Successfully scraped profile for: {name}")
    except:
        print("Could not extract name from profile")

    return profile_page


def scrape_medium_with_selenium(driver, article_url, username, password):
    # First try to access the article without login
    driver.get(article_url)
    time.sleep(3)

    # Check if login is needed (look for login wall)
    if "Sign in" in driver.page_source or "become a member" in driver.page_source.lower():
        print("Login required for Medium. Attempting to log in...")

        # Go to Medium sign-in page
        driver.get("https://medium.com/m/signin")
        time.sleep(2)

        # Click on "Sign in with email" if it exists
        try:
            email_button = driver.find_element(By.XPATH, "//button[contains(text(), 'Sign in with email')]")
            email_button.click()
            time.sleep(1)
        except:
            print("Could not find 'Sign in with email' button, proceeding...")

        # Enter email
        try:
            email_field = driver.find_element(By.NAME, "email")
            email_field.send_keys(username)
            email_field.send_keys(Keys.RETURN)
            time.sleep(2)

            print("Email submitted. Please check your email for a login link from Medium.")
            print("Medium requires manual login via email link. Cannot proceed automatically.")
            return "Medium login requires email verification. Please login manually."
        except:
            print("Could not find email field for Medium login")
            return "Medium login failed"

    # Get article content
    time.sleep(3)
    article_content = driver.page_source

    # Parse with BeautifulSoup for more structured content
    soup = BeautifulSoup(article_content, "html.parser")

    return article_content


# Run the scraping demo
run_scraping_demo()