# 02_Website_Scraping_Playwright

This notebook demonstrates how to scrape dynamic website content using Playwright and handle authentication for LinkedIn and Medium. We will cover the installation of necessary packages, configuration, and connection to the websites, as well as converting the scraped content into vector embeddings using LangChain and summarizing the content from memory in markdown format.

## Step 1: Install Necessary Packages

First, we need to install the required packages. Run the following command to install Playwright, LangChain, and other dependencies.

In [1]:
%pip install -q playwright langchain

Note: you may need to restart the kernel to use updated packages.


## Step 2: Import Packages

Next, we will import the necessary packages for our setup.

In [2]:
from playwright.sync_api import sync_playwright
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import CharacterTextSplitter
from langchain.chains import LLMChain
from langchain.llms import OpenAI
import os

## Step 3: Configure Playwright and Handle Authentication

We need to configure Playwright to handle authentication for LinkedIn and Medium. The following code sets up the connection and handles the login process.

In [3]:
def login_linkedin(page, username, password):
    page.goto('https://www.linkedin.com/login')
    page.fill('input[name="session_key"]', username)
    page.fill('input[name="session_password"]', password)
    page.click('button[type="submit"]')

def login_medium(page, username, password):
    page.goto('https://medium.com/m/signin')
    page.fill('input[name="email"]', username)
    page.click('button[type="submit"]')
    # Medium sends a login link to the email, so manual intervention is needed here
    print("Please check your email and click the login link sent by Medium.")

## Step 4: Scrape Dynamic Website Content

We will use Playwright to scrape dynamic website content from LinkedIn and Medium.

In [4]:
def scrape_linkedin_profile(page, profile_url):
    page.goto(profile_url)
    content = page.content()
    return content

def scrape_medium_article(page, article_url):
    page.goto(article_url)
    content = page.content()
    return content

## Step 5: Convert Scraped Content into Vector Embeddings

We will use LangChain to convert the scraped content into vector embeddings and store them in a Chroma vector store.

In [5]:
def convert_to_embeddings(content):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_text(content)
    embeddings = OpenAIEmbeddings()
    vector_store = Chroma.from_texts(texts, embeddings)
    return vector_store

## Step 6: Summarize Content from Memory

We will use LangChain to summarize the content from memory in markdown format.

In [6]:
def summarize_content(vector_store):
    llm = OpenAI()
    chain = LLMChain(llm=llm)
    summary = chain.run(vector_store)
    return summary

## Step 7: Interactive Demonstration

We will demonstrate the entire process interactively, with clear markdown cells explaining each step.

In [7]:
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False)
    page = browser.new_page()

    # LinkedIn login
    linkedin_username = os.getenv('LINKEDIN_USERNAME')
    linkedin_password = os.getenv('LINKEDIN_PASSWORD')
    login_linkedin(page, linkedin_username, linkedin_password)

    # Scrape LinkedIn profile
    linkedin_profile_url = 'https://www.linkedin.com/in/some-profile/'
    linkedin_content = scrape_linkedin_profile(page, linkedin_profile_url)

    # Convert LinkedIn content to embeddings
    linkedin_vector_store = convert_to_embeddings(linkedin_content)

    # Summarize LinkedIn content
    linkedin_summary = summarize_content(linkedin_vector_store)
    print("LinkedIn Summary:", linkedin_summary)

    # Medium login
    medium_username = os.getenv('MEDIUM_USERNAME')
    medium_password = os.getenv('MEDIUM_PASSWORD')
    login_medium(page, medium_username, medium_password)

    # Scrape Medium article
    medium_article_url = 'https://medium.com/some-article'
    medium_content = scrape_medium_article(page, medium_article_url)

    # Convert Medium content to embeddings
    medium_vector_store = convert_to_embeddings(medium_content)

    # Summarize Medium content
    medium_summary = summarize_content(medium_vector_store)
    print("Medium Summary:", medium_summary)

    browser.close()