# ISOM 352 Applied Data Analytics with Coding
## M3.1 Scraping data from the web

In this class, we will explore webscraping from webpages on the internet:
Specifically, we will complete the following task:

- Scrape review data from IMDB.com
- Analyze the comments


In [9]:
# Install and import the library 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm

plt.style.use('ggplot')

## Task 1: Set up `Selenium` for Webscraping

### Step 0: Install the necessary library

In [2]:
# if you're using Google Colab, Do the following
# !apt-get update
# !apt install chromium-chromedriver
# !pip install selenium bs4

# # if you're using VS Code Locally, Do the following instead
%pip install selenium 


Note: you may need to restart the kernel to use updated packages.


### Step 1: Setup the explorer

In [10]:
from selenium import webdriver
from selenium.webdriver.common.by import By

def web_driver(headless=False):
    options = webdriver.ChromeOptions()
    options.add_argument("--disable-extensions")
    options.add_argument("--disable-infobars")
    options.add_argument("--incognito") # private mode
    options.add_argument("--no-sandbox")
    options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2}) # Disable images
    options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36") # Set user agent to mimic 

    # Configure GUI 
    if headless:
        options.add_argument("--headless")  # no GUI

    # define and return a driver (A Chrome page/tab)
    return webdriver.Chrome(options=options)



In [11]:
# Step 1: Initialize driver (A Chrome page/tab)
driver = web_driver()

In [12]:
# Step 2: Access the webpage using Chrome and get the data on the page
imdb_url = 'https://www.imdb.com/title/tt0120338/reviews'
# Navigate to Titanic Review Page
driver.get(imdb_url)

In [6]:
# Step 3: close the driver
driver.quit()

A webpage is structured using HTML (HyperText Markup Language), which organizes content into a hierarchical tree of elements. Here are the key components of a typical webpage structure:

**DOCTYPE Declaration:**

- Specifies the HTML version being used (e.g., `<!DOCTYPE html>` for HTML5).

**HTML Element:**

- The root element that wraps all other content (`<html>`).

**Head Section:**

- Contains meta-information about the webpage, such as the title, character set, and links to stylesheets and scripts.
- Common elements include:
    - `<title>`: Sets the title of the webpage (displayed in the browser tab).
    - `<meta>`: Provides metadata like character encoding and viewport settings.
    - `<link>`: Links to external resources like CSS files.
    - `<script>`: Links to or contains JavaScript code.

**Body Section:**

- Contains the visible content of the webpage, such as text, images, and interactive elements.
- Common elements include:
    - `<header>`: Defines the header section, often containing navigation links.
    - `<nav>`: Defines a navigation menu.
    - `<main>`: Contains the main content of the webpage.
    - `<section>`: Defines sections within the main content.
    - `<article>`: Represents independent, self-contained content.
    - `<aside>`: Contains content related to the main content, like sidebars.
    - `<footer>`: Defines the footer section, often containing contact information and links.
    - `<div>`: A generic container for grouping content.
    - `<p>`: Defines a paragraph.
    - `<h1>` to `<h6>`: Define headings, with `<h1>` being the highest level.
    - `<img>`: Embeds an image.
    - `<a>`: Defines a hyperlink.
    - `<ul>`, `<ol>`, `<li>`: Define lists (unordered and ordered).

**Attributes:**

- HTML elements can have attributes that provide additional information, such as `id`, `class`, `src`, `href`, and `style`.

**CSS and JavaScript:**

- CSS (Cascading Style Sheets) is used to style the webpage, controlling layout, colors, fonts, and more.
- JavaScript is used to add interactivity and dynamic behavior to the webpage.

## Task 2 Scrape reviews from IMDB
Titanic at 'https://www.imdb.com/title/tt0120338/reviews'

### 2.1 Scrape the titles of the reviews with tag name
- tag: `driver.find_element(By.TAG_NAME, 'some-tag')`

In [13]:
# Find all the titles using tag name
titles = driver.find_elements(By.TAG_NAME, 'h3')

print(f"There are {len(titles)} titles on this page")

# Print the title text
for title in titles:
    # Print the title text
    print(title.text)
    
    # Get and print attributes of the title element
    print(f"Tag name: {title.tag_name}")
    print(f"Class attribute: {title.get_attribute('class')}")
    
    print("-" * 50)  # Separator for readability



There are 29 titles on this page
In retrospect, we were all too hard on this film
Tag name: h3
Class attribute: ipc-title__text
--------------------------------------------------
One hell of a movie
Tag name: h3
Class attribute: ipc-title__text
--------------------------------------------------
Amazing masterpiece
Tag name: h3
Class attribute: ipc-title__text
--------------------------------------------------
Amazing in 1997, 2005, 2015, 2030, 3010 & forever more a Masterpiece!
Tag name: h3
Class attribute: ipc-title__text
--------------------------------------------------
How many times I watch this movie.... It's still the masterpiece.
Tag name: h3
Class attribute: ipc-title__text
--------------------------------------------------
This was one of the few movies that actually brought me to tears.
Tag name: h3
Class attribute: ipc-title__text
--------------------------------------------------
Titanic masterpiece: an emotional and visual thrill ride
Tag name: h3
Class attribute: ipc-tit

### 2.2 CSS (Cascading Style Sheets)

- CSS is a styling language used to describe the presentation of HTML documents
- It controls the visual appearance of web elements (color, size, font, spacing, etc.)
- CSS uses selectors to target HTML elements and apply style rules to them
- CSS selectors can target elements based on their `attributes`, `classes`, `IDs`, or `position` in the document


#### Search by combining elements and CSS

`driver.find_elements(By.CSS_SELECTOR, 'h3.ipc-title__text')`
ex: 
Selenium CSS Selector Combinations:

1. Tag with class:
- `h3.ipc-title__text`

2. Tag with ID:
- `h3#review-title`

3. Tag with attribute:
- `h3[data-testid="review-title"]`
- `a[href="https://example.com"]`
- `h3[id="review-title"]`
- `h3[class="ipc-title__text"]`

4. Tag with multiple attributes:
- `h3[class="ipc-title__text"][data-testid="review-title"]`

5. Attribute value matching:
- Exact match: `h3[class="ipc-title__text"]`
- Contains: `h3[class*="title"]`
- Starts with: `h3[class^="ipc"]`
- Ends with: `h3[class$="text"]`



In [14]:
# Find all the titles using CSS selector
titles = driver.find_elements(By.CSS_SELECTOR, 'h3')

print(f"There are {len(titles)} titles on this page")

There are 29 titles on this page


### 2.3 leverage HTML structure: 
Search by child or descendant combo
ex:
- direct child: `a > h3.ipc-title__text`
- indirect child: `a h3.ipc-title__text`

In [15]:
# Find all the titles using CSS selector
titles = driver.find_elements(By.CSS_SELECTOR, 'article a > h3')

print(f"There are {len(titles)} titles on this page")


There are 23 titles on this page


### 2.4 Scrape the reviews
- starts with the common structure 
- identify relevant elements in each structure 

In [16]:
# First, find all the articles (each article contains a review)
articles = driver.find_elements(By.CSS_SELECTOR, 'article')
print(f"There are {len(articles)} articles on this page")

# Then, find the title and review text within each article
for article in articles:
    pass


There are 23 articles on this page


## Task 3: What do we do with the review?

Leverage the state of art LLM for sentiment analysis 

In [17]:
# first install the library
%pip install openai -q

Note: you may need to restart the kernel to use updated packages.


In [18]:
# import openai library
from openai import OpenAI

# set the API key
OPENAI_API_KEY = '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'

# initialize an OPENAI instance
llm = OpenAI(api_key=OPENAI_API_KEY)

In [19]:
# Define a function to get the completion
def get_completion(prompt, model="gpt-4o-mini", temperature=0.1, max_tokens=None, system_message=None):
    """
    Get a completion from the OpenAI API.
    
    Args:
        prompt (str): The user prompt to send to the model
        model (str): The model to use for completion (default: "gpt-4o-mini")
        temperature (float): Controls randomness in the output (0-2, lower is more deterministic)
        max_tokens (int, optional): Maximum number of tokens to generate
        system_message (str, optional): System message to set context for the conversation
        
    Returns:
        str: The model's response content
    """
    messages = []
    
    # Add system message if provided
    if system_message:
        messages.append({"role": "system", "content": system_message})
    
    # Add user message
    messages.append({"role": "user", "content": prompt})
    
    try:
        response = llm.chat.completions.create(
            model=model,
            messages=messages,
            temperature=temperature,
            max_tokens=max_tokens
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Error getting completion: {e}")
        return None

### Call OpenAI API to programmtically access LLM

In [None]:
prompt = f"""
score the following review from 0 to 10. 

Review: {review.text}
"""
response = get_completion(prompt)
print(response)